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Abstract. Gene regulation in Eukaryotes is mainly effected through transcription 
factors binding to rather short recognition motifs generally located upstream of 
the coding region. We present a novel computational method to identify regulatory 
elements in the upstream region of Eukaryotic genes. The genes are grouped in sets 
sharing an overrepresented short motif in their upstream sequence. For each set, 
the average expression level from a microarray experiment is determined: if this 
level is significantly higher or lower than the average taken over the whole genome, 
then the overrepresented motif shared by the genes in the set is likely to play a 
role in their regulation. We illustrate the method by applying it to the genome 
of S. cerevisiae, for which many datasets of microarray experiments are publicly 
available. Several known binding motifs are correctly recognized by our algorithm, 
and a new candidate is suggested for experimental verification. 

1 Introduction 

One of the most interesting subjects of computational genomics is the char- 
acterization of the network of regulatory interactions between genes. A dom- 
inant role in this web of interactions is played by the mechanism of transcrip- 
tional regulation, in which one or more gene products regulate the transcrip- 
tion of other genes into mRNA, by binding to recognition motifs (short DNA 
sequences) generally located upstream of the coding region of the regulated 
gene. A key step in the study of transcriptional regulation is the identification 
of such recognition motifs. 

In recent years, it has become possible to attack this problem on a ge- 
nomic scale with computational methods, thanks to the huge amount of data 
generated by modern experimental techniques. In particular, two sets of ex- 
perimental data are of utmost importance to the study of transcriptional 
regulation: first, the availability of the complete, annotated DNA sequence 
for many model organisms allows the systematic exploration of the upstream 
regions where the motifs are to be found; second, microrarray techniques 
allow the identification of groups of coregulated genes. 

When trying to identify regulatory motifs within a computational ap- 
proach, one is interested in finding instances of statistical correlation between 



the presence of certain sequences in the upstream region and common pat- 
terns of expression. This search is facilitated by the fact that, in many cases, 
the regulatory motif must be repeated several times in the upstream region to 
be effective, and therefore the number of its appearances in the upstream re- 
gion is above what expected from chance alone. The most common approach 
to the problem consists in two steps: first, a set of co-cxprcsscd genes is iden- 
tified, either using direct experimental knowledge or by applying clustering 
techniques to microarray data; then the upstream region of such genes are 
systematically analyzed, looking for motifs that are overrepresented in some 
statistically precise sense. Examples of this procedure as applied to yeast 
(Saccharomyces cerevisiae) are Refs. |1I2I3| . 

In this work we propose an alternative procedure, which in some sense 
reverses the procedure: the genes are grouped based on the short sequences 
that are overrepresented in their upstream region, and such groups are then 
tested for evidence of coregulation, using microarray data. Whenever such 
evidence of coregulation is found, the sequence characterizing the upstream 
region of the genes is a candidate regulatory site. The most attractive feature 
of this reversed procedure is that the grouping of the genes is based on the 
sequence only: the sets of genes thus obtained can be analysed for evidence 
of coregulation not only by using microarray data, as described in this con- 
tribution, but also by comparing them to other biological data, such as e.g. 
functional annotation and the composition of protein complexes. 

A somewhat related method, that does not rely on previous clustering of 
the genes, is described in Ref. @j. In this contribution we focus on describing 
the method: for a detailed discussion of the results and comparison with other 
methods, we refer the interested reader to the original paper, Ref.[S|. 

2 Grouping the genes based on overrepresented 
upstream motifs 

The first step in our procedure consists in analyzing the upstream region of 
all the ^6000 yeast genes, and identify for each gene the short DNA sequences 
(words) which are overrepresented in their upstream region. Overreprescnta- 
tion of a word in the upstream sequence of a gene is defined by comparison 
to the prevalence of the same word in all the upstream regions taken as a 
single sample. 

Specifically, for each possible word w (of length 6-8 in the present study) 
we construct the set S(w) of the genes in whose upstream region w is over- 
represented by the following procedure: 

• For each yeast gene g we compute the number m g (w) of occurrences of w 
in the upstream region of g. The length of the upstream region we consider 
is 500 base pairs, but is shortened whenever necessary to avoid overlap- 
ping with the coding region of the preceding open reading frame (ORF). 



Therefore the length of the coding region depends on the gene considered, 
and will be denoted by K g . Non palindromic words are counted on both 
strands: therefore we define the effective number of occurrences n g (w) as 



n g( w ) = m g( w ) + m g(^) if w ^ w (1) 
n g (w) = m g {w) if w = w (2) 

where w is the reverse complement of w. 

We define the global frequency p(w) of each word w as 

E s L g( w ) 

where, in order to count correctly the available space for palindromic and 
non palindromic words, 

L g {w) = 2{K g - I + 1) ifw^w (4) 
L g (w) = (Kg - I + 1) if w = w (5) 

where I is the length of w. p(w) is therefore the frequency with which 
the word w appears in the upstream regions of the whole genome: it is 
the "background frequency" against which occurrences in the upstream 
regions of the individual genes arc compared to determine which words 
are ovcrrcprcscntcd. 

For each ORF g and each word w we compute the probability b g (w) of 
finding n g (w) or more occurrences of w based on the global frequency 
p(w): 

b g (w)= £ r B W\p(wr[i- P (w)] L >w- n (6) 

n=n g (to) 

We define a maximum probability P and consider, for each w, the set 

S(w) = {g : b g (w) < P} (7) 

of the ORFs in which the word w is overrepresented compared to the fre- 
quency of w in the upstream regions of the whole genome. That is, w is 
considered overrepresented in the upstream region of g if the probability 
of finding n g (w) or more instances of w based on the global frequency is 
less than P. In this study we chose P = 0.02. Note that this is a rather 
lenient cutoff: however no biological significance is attributed to the set 
S(w) in itself: only those sets which will pass the test for coregulation, de- 
scribed in the following section, will be considered of biological relevance, 
and hence it is on this second test that a stringent cutoff on P-valuc must 
be imposed. 



3 Looking for coregulation within the sets S(w) 



The second step of the procedure consists in verifying whether the genes 
contained in each set S(w) show evidence of coregulation in their expression 
levels as measured in microarray experiments. This is done by computing 
the average expression level of the genes in the set and comparing it to the 
average expression level for the same experiment computed over the whole 
genome. We used the data from the diauxic shift experiment of Ref. pQ which 
provides us with expression measurements for virtually all the yeast genes 
at 7 timcpoints corresponding to progressive depletion of the glucose in the 
medium, with the consequent metabolic shift form fermentation to respira- 
tion. We proceed as follows: 

• For each time-point i we computed the genome-wide average expression 
R{i) and its standard deviation a(i). 

• Then for each word w we compute the average expression in the subset 
of S w given by the genes for which an experimental result is available at 
timepoint i (in most cases this coincides with S w ): 

v ' ' ges w 

where N(i,w) is the number of ORFs in S w for which an experimental 
result at timepoint i is available, and r g (i) is the corresponding expression 
level (as customary, this is defined as the log 2 of the ratio between the 
mRNA level at timepoint i and the initial mRNA level). 

• The difference 

AR w (i) = R w (i) -R(i) (9) 

represents the discrepancy between the genome-wide average expression 
at time-point i and the average expression at the same time-point of the 
ORFs that share an abundance of the word w in their upstream region. 

• A significance index sig(z,u>) is defined as 

sigM) = ^^V^M) (10) 

and the word w is considered significantly correlated with expression at 
time point i if 

\sig(i,w)\> A (11) 

In this work we chose A = 6: this means that we consider meaningful 
a deviation of R w (i) by six s.d.'s from its expected value. The sign of 
sig(i, w) indicates whether w acts as an enhancer or an inhibitor of gene 
expression. 



4 Results 



A total of 29 words of length between 6 and 8 base pairs pass our significa- 
tivity test for at least one timcpoint in the diauxic shift experiment. Two of 
them must be eliminated as their statistical significance can be ascribed to 
the existence of a family of several nearly identical ORFs with nearly identi- 
cal upstream regions. Out of the remaining 27 words, 26 can be confidently 
identified with a known regulatory motif, and one word is a new candidate. 
Examples of significant words, including the new candidate ATAAGGG, are 
reported in Tab. ^ 

Table 1. Example of regulatory words found by our method, together with the 
known regulatory motif they belong to. In the third column we report the time- 
point^) of the diauxic shift experiment where the word passes the significativity 
test. The sign in the fourth column indicates whether the genes in the set S(w) are 
induced or repressed at these timepoints. 



word 


motif 


timepoints +/- 


GATGAG 


PAC 


4 




GATGAGAT 


PAC 


4,5,6,7 




GATGAGA 


PAC 


4,7 




AAAATTT 


RRPE 


6,7 




AAAATTTT 


RRPE 


4,6,7 




CCACCCCC 


STRE 


6 


+ 


CCCCCCCT 


STRE 


6 


+ 


TACCCC 


MIG1 


6 


+ 


GCCGCC 


UME6 


7 


+ 


ATAAGGG 


new 


6,7 


+ 



In Figs. H arj d HI , we give an example of coregulation in the set S(w) 
corresponding to the significant word GATGAG 

5 Conclusions 

Our method appears to be very efficient in identifying regulatory motifs whose 
effectiveness depend on many repetitions of a short sequence in the upstream 
region of the regulated genes. The main feature that differentiates our method 
from existing algorithms for motif discovery is the fact that genes are grouped 
a priori based on similarities in their upstream sequences. The fact that most 
of the motif revealed by the method turn out to be known regulatory motifs 
suggests a very low rate of false positives. Also in Ref.0] a computational 
method was proposed in which candidate binding sites are identified using 
statistical correlations between upstream sequence and expression data. Two 
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Fig. 1. Expression of the genes in the set S(GATGAG): The average expression of 
the genes in the set (solid line) is compared to the genome-wide average expression 
(dashed line) at the seven time points of the diauxic shift experiment. 
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Fig. 2. Statistical significance sig(i, w) as defined in Ea. ljlO^ for the word w — 
GATGAG and timepoints i = 1, . . . , 7 in the diauxic shift experiment. The dashed 
line is the significance threshold |sig| = 6. 

important differences can be identified between our method and the one pro- 
posed in Ref.0] (see Ref.[5] for a detailed comparison of the results): first, 
our algorithm does not assume a linear dependence of the expression level 
from the number of repetitions of the motif. Second, our sets of genes, being 
based on the upstream sequence only, can be screened for evidence of coregu- 
lation not only by comparison with expression data, as described above, but 
also with other types of biological datasets, such as functional annotation or 
experimentally determined protein complexes. 
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